As a data scientist you will work with ordinal or categorical data as well. What are good methods to visualize such data? What are proper statistics to use? What needs to be done in the data preparation steps to conduct certain analysis. During this week we will work with data from a sleeping study. This dataset is not mandatory. You are encouraged to use data from your project when possible.
Keywords: statistics, categorical data, ordinal data, survey-based study of the sleeping, exploratory data analysis, normalization, hypothesis testing, p-value
More to read:
https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/ There are a number of cheatsheets and tutorials on the internet. The next overview is a compact overview of tutorials
You will learn about analysing data with pandas and numpy and you will learn to visualize with bokeh. Concretely, you will preprocess the Sleep Study data in an appropiate format in order to conduct statistical and visual analysis.
Learning objectives
Please add topics you want to learn here: https://padlet.com/ffeenstra1/z9duo25d39dcgezz
The data is collected from a survey-based study of the sleeping habits of individuals within the US.
Below is a description of each of the variables contained within the dataset.
The two research questions you should answer in this assignment are:
The assignment consists of 6 parts:
Part 1 till 5 are mandatory, part 6 is optional (bonus) Mind you that you cannot copy code without referencing the code. If you copy code you need to be able to explain your code verbally and you will not get the full score.
NOTE If your project data is suitable you can use that data instead of the given data
Analysis of variance (ANOVA) compares the variances between groups versus within groups. It basically determines whether the differences between groups is larger than the differences within a group (the noise). A graph picturing this is as follow: https://link.springer.com/article/10.1007/s00424-019-02300-4/figures/2
In ANOVA, the dependent variable must be a continuous (interval or ratio) level of measurement. For instance Glucose level. The independent variables in ANOVA must be categorical (nominal or ordinal) variables. For instance trial category, time of day (AM versus PM) or time of trial (different categories). Like the t-test, ANOVA is also a parametric test and has some assumptions. ANOVA assumes that the data is normally distributed. The ANOVA also assumes homogeneity of variance, which means that the variance among the groups should be approximately equal. ANOVA also assumes that the observations are independent of each other.
A one-way ANOVA has just one independent variable. A two-way ANOVA (are also called factorial ANOVA) refers to an ANOVA using two independent variables. For research question 1 we can use the one-way ANOVA, for research question two we can use two-way ANOVA. But first we need to check the assumptions.
If your data is not normally distributed you might want to look for an alternative. See also https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/parametric-and-non-parametric-data/
load the sleep.csv data.
Preferably we read the data not with a hard coded data path but using a config file. See https://fennaf.gitbook.io/bfvm22prog1/data-processing/configuration-files/yaml. Get yourself familiar with the data. Answer the following questions.
import yaml
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
with open('config.yaml') as stream:
config = yaml.safe_load(stream)
df = pd.read_csv(config['file_assignment_4b'])
df.head()
| Enough | Hours | PhoneReach | PhoneTime | Tired | Breakfast | |
|---|---|---|---|---|---|---|
| 0 | Yes | 8.0 | Yes | Yes | 3 | Yes |
| 1 | No | 6.0 | Yes | Yes | 3 | No |
| 2 | Yes | 6.0 | Yes | Yes | 2 | Yes |
| 3 | No | 7.0 | Yes | Yes | 4 | No |
| 4 | No | 7.0 | Yes | Yes | 2 | Yes |
# code printing percentage missing data
for column in df.columns:
print(f'Column "{column}" has {sum(df[column].isna())} missing datapoints')
print('\nMissing datapoints below:')
print(df[df['Hours'].isna()])
print(f'The percentage of missing data is: {(2/len(df))*100:.2f}%')
Column "Enough" has 0 missing datapoints Column "Hours" has 2 missing datapoints Column "PhoneReach" has 0 missing datapoints Column "PhoneTime" has 0 missing datapoints Column "Tired" has 0 missing datapoints Column "Breakfast" has 0 missing datapoints Missing datapoints below: Enough Hours PhoneReach PhoneTime Tired Breakfast 65 No NaN Yes No 3 Yes 91 Yes NaN No Yes 2 Yes The percentage of missing data is: 1.92%
df.groupby('Tired').describe()
| Hours | ||||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | |
| Tired | ||||||||
| 1 | 4.0 | 8.000000 | 0.816497 | 7.0 | 7.75 | 8.0 | 8.25 | 9.0 |
| 2 | 26.0 | 6.653846 | 1.468123 | 4.0 | 6.00 | 7.0 | 7.00 | 9.0 |
| 3 | 39.0 | 6.717949 | 0.971941 | 5.0 | 6.00 | 7.0 | 7.00 | 9.0 |
| 4 | 23.0 | 6.739130 | 1.251086 | 5.0 | 6.00 | 7.0 | 7.00 | 10.0 |
| 5 | 10.0 | 5.700000 | 2.584140 | 2.0 | 4.25 | 5.5 | 7.75 | 9.0 |
#code printing answer dependent and independent variables
print("""The independent variables are: Breakfast and Tiredness
The dependent variable is Hours (of sleep).""")
The independent variables are: Breakfast and Tiredness The dependent variable is Hours (of sleep).
#code printing answer about datatypes
print(df.info())
print('''
In my opinion the datatypes of the columns make sense, however
when you want to investigate correlations this does not work on
string objects. For this, dummy variables can be created''')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Enough 104 non-null object
1 Hours 102 non-null float64
2 PhoneReach 104 non-null object
3 PhoneTime 104 non-null object
4 Tired 104 non-null int64
5 Breakfast 104 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 5.0+ KB
None
In my opinion the datatypes of the columns make sense, however
when you want to investigate correlations this does not work on
string objects. For this, dummy variables can be created
Inspect the data practically. Get an idea about how well the variable categories are ballanced. Are the values of a variable equally divided? What is the mean value of the dependent variable? Are there correlations amongs the variables?
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.layouts import gridplot
from bokeh.transform import jitter
output_notebook()
# code your answer to the value counts and distribution plots here
vc_tired = df['Tired'].value_counts()
vc_hours = df['Hours'].value_counts()
p1 = figure(title='Tired, value counts', x_axis_label='Tired score (1 - 5)',
y_axis_label = 'counts')
p1.vbar(x=vc_tired.index, top=vc_tired.values, width=0.5)
p2 = figure(title='Hours slept, value counts', x_axis_label='Sleep (hours)',
y_axis_label = 'counts')
p2.vbar(x=vc_hours.index, top=vc_hours.values, width=0.5)
show(gridplot(children=[[p1,p2]], width = 400, height = 400))
p = figure(title = 'Hours slept against tiredness',
y_axis_label = 'Hours slept (Hours)',
x_axis_label = 'Tired (Score, 1-5)')
# breakfast = Yes
p.circle(y=jitter('Hours',width=0.2), x=jitter('Tired', width=0.3),
source = df[df['Breakfast'] == 'Yes'],
color = 'orange', legend_label = 'Breakfast', alpha=0.5, size=10)
# breakfast = No
p.circle(y=jitter('Hours',width=0.2), x=jitter('Tired', width=0.3),
source = df[df['Breakfast'] == 'No'],
color = 'green', legend_label = 'No Breakfast', alpha=0.5, size=10)
show(p)
sns.kdeplot(x='Hours', hue='Tired', data=df, fill = True, palette = 'Set2',)
plt.xlim((1,10))
(1.0, 10.0)
sns.kdeplot(x=df['Tired'], hue=df['Breakfast'], fill = True, palette='Set1')
plt.xlim(1,5)
plt.xticks([1,2,3,4,5])
;
''
Below are the same plots as above made with holoview, just as an exercice and to see the difference between different plotting tools.
import hvplot.pandas
df[['Hours','Tired']].hvplot.kde(by='Tired',
legend = 'top_left',
xlim=(1,10),
xticks=list(range(1,11)),
title = 'Density of hours slept statified by tired score')
import hvplot.pandas
df[['Breakfast','Tired']].hvplot.kde(by='Breakfast',
legend = 'top_left',
xlim=(1,5),
xticks=[1,2,3,4,5],
title = 'Breakfast density given tired score')
# statistics
df_dummies = pd.get_dummies(df, drop_first=True)
#code your answer for the heatmap here and briefly state your finding
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(df_dummies.corr(), cmap='viridis', annot=True)
print("""
There seems to be a relatively strong negative correlation between
'Enough_Yes' and 'Tired', surprisingly there doesn't seem to be
a strong correlation between tired and hours.
""")
There seems to be a relatively strong negative correlation between
'Enough_Yes' and 'Tired', surprisingly there doesn't seem to be
a strong correlation between tired and hours.
Before we answer the research question with ANOVA we need to check the following assumptions:
We are going to do this graphically and statistically.
df['Hours'].value_counts()
7.0 35 6.0 24 8.0 16 5.0 12 9.0 8 4.0 4 2.0 2 10.0 1 Name: Hours, dtype: int64
p = figure()
p.line(x=vc_hours.index, y=vc_hours.values)
show(p)
# briefly summarize your findings
# your code for the statistical test here
# briefly summarize your findings
Create a dataframe with equal samplesize. Make three categories for tireness 1-2 = no, 3 = maybe, 4-5 = yes
equal_df = df
equal_df
| Enough | Hours | PhoneReach | PhoneTime | Tired | Breakfast | |
|---|---|---|---|---|---|---|
| 0 | Yes | 8.0 | Yes | Yes | 3 | Yes |
| 1 | No | 6.0 | Yes | Yes | 3 | No |
| 2 | Yes | 6.0 | Yes | Yes | 2 | Yes |
| 3 | No | 7.0 | Yes | Yes | 4 | No |
| 4 | No | 7.0 | Yes | Yes | 2 | Yes |
| ... | ... | ... | ... | ... | ... | ... |
| 99 | No | 7.0 | Yes | Yes | 2 | Yes |
| 100 | No | 7.0 | No | Yes | 3 | Yes |
| 101 | Yes | 8.0 | Yes | Yes | 3 | Yes |
| 102 | Yes | 7.0 | Yes | Yes | 2 | Yes |
| 103 | Yes | 6.0 | Yes | Yes | 3 | Yes |
104 rows × 6 columns
equal_df['Tired'] = equal_df['Tired'].map({1:'no', 2:'no',
3:'maybe',
4:'yes', 5:'yes'})
equal_df
| Enough | Hours | PhoneReach | PhoneTime | Tired | Breakfast | |
|---|---|---|---|---|---|---|
| 0 | Yes | 8.0 | Yes | Yes | maybe | Yes |
| 1 | No | 6.0 | Yes | Yes | maybe | No |
| 2 | Yes | 6.0 | Yes | Yes | no | Yes |
| 3 | No | 7.0 | Yes | Yes | yes | No |
| 4 | No | 7.0 | Yes | Yes | no | Yes |
| ... | ... | ... | ... | ... | ... | ... |
| 99 | No | 7.0 | Yes | Yes | no | Yes |
| 100 | No | 7.0 | No | Yes | maybe | Yes |
| 101 | Yes | 8.0 | Yes | Yes | maybe | Yes |
| 102 | Yes | 7.0 | Yes | Yes | no | Yes |
| 103 | Yes | 6.0 | Yes | Yes | maybe | Yes |
104 rows × 6 columns
df
| Enough | Hours | PhoneReach | PhoneTime | Tired | Breakfast | |
|---|---|---|---|---|---|---|
| 0 | Yes | 8.0 | Yes | Yes | maybe | Yes |
| 1 | No | 6.0 | Yes | Yes | maybe | No |
| 2 | Yes | 6.0 | Yes | Yes | no | Yes |
| 3 | No | 7.0 | Yes | Yes | yes | No |
| 4 | No | 7.0 | Yes | Yes | no | Yes |
| ... | ... | ... | ... | ... | ... | ... |
| 99 | No | 7.0 | Yes | Yes | no | Yes |
| 100 | No | 7.0 | No | Yes | maybe | Yes |
| 101 | Yes | 8.0 | Yes | Yes | maybe | Yes |
| 102 | Yes | 7.0 | Yes | Yes | no | Yes |
| 103 | Yes | 6.0 | Yes | Yes | maybe | Yes |
104 rows × 6 columns
For a reason I can't understand somehow map also mapped the df Tired column.
from scipy import stats
stats.anova()
Create a panel with 1) your dataframe with equal samplesize 2) a picture of a sleeping beauty, 3) the scatter plot of tired / hours of sleep with different colors for Breakfast from part 2 4) the boxplots given the p-value for the anova outcome in the title
#your solution here